11 research outputs found

    AmodalSynthDrive: A Synthetic Amodal Perception Dataset for Autonomous Driving

    Full text link
    Unlike humans, who can effortlessly estimate the entirety of objects even when partially occluded, modern computer vision algorithms still find this aspect extremely challenging. Leveraging this amodal perception for autonomous driving remains largely untapped due to the lack of suitable datasets. The curation of these datasets is primarily hindered by significant annotation costs and mitigating annotator subjectivity in accurately labeling occluded regions. To address these limitations, we introduce AmodalSynthDrive, a synthetic multi-task multi-modal amodal perception dataset. The dataset provides multi-view camera images, 3D bounding boxes, LiDAR data, and odometry for 150 driving sequences with over 1M object annotations in diverse traffic, weather, and lighting conditions. AmodalSynthDrive supports multiple amodal scene understanding tasks including the introduced amodal depth estimation for enhanced spatial understanding. We evaluate several baselines for each of these tasks to illustrate the challenges and set up public benchmarking servers. The dataset is available at http://amodalsynthdrive.cs.uni-freiburg.de

    Amodal Optical Flow

    Full text link
    Optical flow estimation is very challenging in situations with transparent or occluded objects. In this work, we address these challenges at the task level by introducing Amodal Optical Flow, which integrates optical flow with amodal perception. Instead of only representing the visible regions, we define amodal optical flow as a multi-layered pixel-level motion field that encompasses both visible and occluded regions of the scene. To facilitate research on this new task, we extend the AmodalSynthDrive dataset to include pixel-level labels for amodal optical flow estimation. We present several strong baselines, along with the Amodal Flow Quality metric to quantify the performance in an interpretable manner. Furthermore, we propose the novel AmodalFlowNet as an initial step toward addressing this task. AmodalFlowNet consists of a transformer-based cost-volume encoder paired with a recurrent transformer decoder which facilitates recurrent hierarchical feature propagation and amodal semantic grounding. We demonstrate the tractability of amodal optical flow in extensive experiments and show its utility for downstream tasks such as panoptic tracking. We make the dataset, code, and trained models publicly available at http://amodal-flow.cs.uni-freiburg.de

    The OmniScape Dataset

    No full text
    International audienceDespite the utility and benefits of omnidirectional images in robotics and automotive applications, there are no datasets of omnidirectional images available with semantic segmentation, depth map, and dynamic properties. This is due to the time cost and human effort required to annotate ground truth images. This paper presents a framework for generating omnidirectional images using images that are acquired from a virtual environment. For this purpose, we demonstrate the relevance of the proposed framework on two well-known simulators: CARLA Simulator, which is an open-source simulator for autonomous driving research, and Grand Theft Auto V (GTA V), which is a very high quality video game. We explain in details the generated OmniScape dataset, which includes stereo fisheye and catadioptric images acquired from the two front sides of a motorcycle, including semantic segmentation, depth map, intrinsic parameters of the cameras and the dynamic parameters of the motorcycle. It is worth noting that the case of two-wheeled vehicles is more challenging than cars due to the specific dynamic of these vehicles

    Génération d'images omnidirectionnelles à partir d'un environnement virtuel

    No full text
    International audienceThis paper describes a method for generating omnidirectional images using cubemap images and corresponding depth maps that can be acquired from a virtual environment. For this purpose, we use the video game Grand Theft Auto V (GTA V). GTA V has been used as a data source in many research projects, due to the fact that it is a hyperrealist open-world game that simulates a real city. We take advantage of developments made in reverse engineering this game, in order to extract realistic images and corresponding depth maps using virtual cameras with 6DoF. By combining the extracted information with an omnidirectional camera model, we generate Fish-eye images intended for instance to machine learning based applications.Dans cet article, nous décrivons une méthode pour générer des images omnidirectionnelles en utilisant des images cubemap et les cartes de profondeur correspondantes, à partir d'un environnement virtuel. Pour l'acquisition, on utilise le jeu vidéo Grand Theft Auto V (GTA V). GTA V a été utilisé comme source de données dans plusieurs travaux de recherche, puisque c'est un jeu à monde ouvert, hyperréaliste, simulant une vraie ville. L'avancée réalisée dans l'ingénierie inverse de ce jeu nous offre la possibilité d'extraire des images et les cartes de profondeur correspondantes avec des caméras virtuelles à six degrés de liberté. A partir de ces données et d'un modÚle de caméra omnidirectionnelle, on propose de générer des images Fisheye destinées par exemple à l'entraßnement de méthodes par apprentissage

    A comparative study of semantic segmentation using omnidirectional images

    No full text
    International audienceThe semantic segmentation of omnidirectional urban driving images is a research topic that has increasingly attracted the attention of researchers. This paper presents a thorough comparative study of different neural network models trained on four different representations: perspective, equirectangular, spherical and fisheye. We use in this study real perspective images, and synthetic perspective, fisheye and equirectangular images, as well as a test set of real fisheye images. We evaluate the performance of convolution on spherical images and perspective images. The conclusions obtained by analyzing the results of this study are multiple and help understanding how different networks learn to deal with omnidirectional distortions. Our main finding is that models trained on omnidirectional images are robust against modality changes and are able to learn a universal representation, giving good results in both perspective and omnidirectional images. The relevance of all results is examined with an analysis of quantitative measures

    A comparative study of semantic segmentation of omnidirectional images from a motorcycle perspective

    No full text
    International audienceThe semantic segmentation of omnidirectional urban driving images is a research topic that has increasingly attracted the attention of researchers, because the use of such images in driving scenes is highly relevant. However, the case of motorized two-wheelers has not been treated yet. Since the dynamics of these vehicles are very different from those of cars, we focus our study on images acquired using a motorcycle. This paper provides a thorough comparative study to show how different deep learning approaches handle omnidirectional images with different representations, including perspective, equirectangular, spherical, and fisheye, and presents the best solution to segment road scene omnidirectional images. We use in this study real perspective images, and synthetic perspective, fisheye and equirectangular images, simulated fisheye images, as well as a test set of real fisheye images. By analyzing both qualitative and quantitative results, the conclusions of this study are multiple, as it helps understand how the networks learn to deal with omnidirectional distortions. Our main findings are that models with planar convolutions give better results than the ones with spherical convolutions, and that models trained on omnidirectional representations transfer better to standard perspective images than vice versa

    TPT-Dance&Actions: a multimodal corpus of hman activities

    No full text
    Nous prĂ©sentons une nouvelle base de donnĂ©es multimodales d’activitĂ©s humaines, TPT - Dance & Actions, s’adressant, parmi d’autres, aux domaines de recherche liĂ©s Ă  l’analyse de scĂšnes multimodales. Ce corpus se focalise sur des scĂšnes de danse (lindy hop, salsa et danse classique), de fitness rythmĂ©es par de la musique, et inclut Ă©galement des enregistrements d’actions isolĂ©es plus “classiques”. Il regroupe 14 chorĂ©graphies de danse et 13 sĂ©quences d’autres activitĂ©s multimodales effectuĂ©es en partie par 20 danseurs et 16 participants respectivement. Ces diffĂ©rentes scĂšnes multimodales sont enregistrĂ©es Ă  travers une variĂ©tĂ© de rĂ©seaux de capteurs : camĂ©ras, capteurs de profondeur, microphones, capteurs piĂ©zoĂ©lectriques et une combinaison de capteurs inertiels (accĂ©lĂ©romĂštres, gyroscopes et magnĂ©tomĂštres). Ces donnĂ©es seront disponibles sur internet librement Ă  des fins de recherche.We present a new multimodal database of human activities, TPT - Dance & Actions, for research in multimodal scene analysis and understanding. This corpus focuses on dance scenes (lindy hop, salsa and classical dance), fitness and isolated sequences. 20 dancers and 16 participants were recorded performing respectively 14 dance choreographies and 13 sequences of other human activities. These different multimodal scenes have been captured through a variety of media modalities, including video cameras, depth sensors, microphones, piezoelectric transducers and wearable inertial devices (accelerometers, gyroscopes et magnetometers). These data will be available to download for research purpose. MOTS-CLÉS : danse, lindy hop, salsa, danse classique, fitness, actions isolĂ©es, donnĂ©es multimodales, audio, vidĂ©os, cartes de profondeur, donnĂ©es inertielles, synchronisation, analyse d’activitĂ©s multimodales, reconnaissance de gestes et d’actions

    Génération d'images omnidirectionnelles à partir d'un environnement virtuel

    No full text
    International audienceThis paper describes a method for generating omnidirectional images using cubemap images and corresponding depth maps that can be acquired from a virtual environment. For this purpose, we use the video game Grand Theft Auto V (GTA V). GTA V has been used as a data source in many research projects, due to the fact that it is a hyperrealist open-world game that simulates a real city. We take advantage of developments made in reverse engineering this game, in order to extract realistic images and corresponding depth maps using virtual cameras with 6DoF. By combining the extracted information with an omnidirectional camera model, we generate Fish-eye images intended for instance to machine learning based applications.Dans cet article, nous décrivons une méthode pour générer des images omnidirectionnelles en utilisant des images cubemap et les cartes de profondeur correspondantes, à partir d'un environnement virtuel. Pour l'acquisition, on utilise le jeu vidéo Grand Theft Auto V (GTA V). GTA V a été utilisé comme source de données dans plusieurs travaux de recherche, puisque c'est un jeu à monde ouvert, hyperréaliste, simulant une vraie ville. L'avancée réalisée dans l'ingénierie inverse de ce jeu nous offre la possibilité d'extraire des images et les cartes de profondeur correspondantes avec des caméras virtuelles à six degrés de liberté. A partir de ces données et d'un modÚle de caméra omnidirectionnelle, on propose de générer des images Fisheye destinées par exemple à l'entraßnement de méthodes par apprentissage

    Fully residual Unet-based semantic segmentation of automotive fisheye images: a comparison of rectangular and deformable convolutions

    No full text
    International audienceSemantic image segmentation is an essential task for autonomous vehicles and self-driving cars where a complete and real-time perception of the surroundings is mandatory. Convolutional Neural Network approaches for semantic segmentation standout over other state-of-the-art solutions due to their powerful generalization ability over unknown data and end-to-end training. Fisheye images are important due to their large field of view and ability to reveal information from broader surroundings. Nevertheless, they pose unique challenges for CNNs, due to object distortion resulting from the Fisheye lens and object position. In addition, large annotated Fisheye datasets required for CNN training is rather limited. In this paper, we investigate the use of Deformable convolutions in accommodating distortions within Fisheye image segmentation for fully residual U-net by learning unknown geometric transformations via variable shaped and sized filters. The proposed models and integration strategies are exploited within two main paradigms: single(front)-view and multi-view Fisheye images segmentation. The validation of the proposed methods is conducted on synthetic and real Fisheye images from the WoodScape and the SynWoodScape datasets. The results validate the significance of the Deformable fully residual U-Net structure in learning unknown geometric distortions in both paradigms, demonstrate the possibility in learning view-agnostic distortion properties when trained on the multi-view data and shed light on the role of surround-view images in increasing segmentation performance relative to the single view. Finally, our experiements suggests that Deformable convolutions are a powerful tool that can increase the efficiency of fully residual U-Nets for semantic segmentation of automotive fisheye images

    SynWoodScape: Synthetic Surround-View Fisheye Camera Dataset for Autonomous Driving

    No full text
    International audienceSurround-view cameras are a primary sensor for automated driving, used for near-field perception. It is one of the most commonly used sensors in commercial vehicles primarily used for parking visualization and automated parking. Four fisheye cameras with a 190 °field of view cover the 360 °around the vehicle. Due to its high radial distortion, the standard algorithms do not extend easily. Previously, we released the first public fisheye surround-view dataset named WoodScape. In this work, we release a synthetic version of the surround-view dataset, covering many of its weaknesses and extending it. Firstly, it is not possible to obtain ground truth for pixel-wise optical flow and depth. Secondly, WoodScape did not have all four cameras annotated simultaneously in order to sample diverse frames. However, this means that multi-camera algorithms cannot be designed to obtain a unified output in birds-eye space, which is enabled in the new dataset. We implemented surround-view fisheye geometric projections in CARLA Simulator matching WoodScape’s configuration and created SynWoodScape. We release 80k images from the synthetic dataset with annotations for 10+ tasks. We also release the baseline code and supporting scripts
    corecore